Introduction

Lecture notes based on texts by Coronel, Widom, Ullman, and Silberschatz.

Admin

Briefly discuss Logistics and expectations of undergrads and grads.

Data or Information

italics are discussion points

what’s the difference?

what are examples of data?

what about examples of information?

Data

raw facts
unprocessed data
can’t tell what it means or how to interpret/use it

Information

result of processing data with some purpose in mind
- salary data and interest data become income information
can reveal the meaning of data
- temperature and humidity alone can’t give you wind chill - you need wind speed
nonsense may be the result of combining data in unexpected ways
- trying to combine average low temperature and shoe size to predict the apple harvest

Why databases?

Discuss the “Horror Story” reading

Databases to the rescue

Shared, integrated computer structure that stores a collection of:

End-user data
- Raw facts of interest to end user
Metadata: Data about data, which the end-user data are integrated and managed
- Describe data characteristics and relationships

DBMS is a set of applications/tools, managing access and maintenance of the data in the DB

Example ERD diagram

between the users and the database/data
mediates between users, data owners, and managers
provides views into the data, without exposing complexity of HW/SW
optimizes access when possible

DBMS

Ask students for examples of databases they have used.

Using a plain filesystem as a DB is unwieldy

Aaron’s problem
File systems might use incompatible formats
Duplication of data in different files/filesystems hard to detect and lead to anomalies of three types:
1. update anomalies
2. insertion anomalies
3. deletion anomalies
Duplication of data is also bad for security & privacy
- where are the data you want to protect or audit?
Different programs to access different sets of data
“Atomicity” of updates is tough
Access control is complicated
Concurrency of access

Example: write on board
1. Alice and Bob have a joint account, both have ATM cards
2. Shopping separately, both decide to get some cash
3. Each visits an different ATM and checks the balance ($100) - each ATM receives her account status
4. Each decides to withdraw $60, since clearly they have the funds.
5. The first one to complete the transaction gets the cash, while the other either:
  - gets an “insufficient funds” error even though he/she had checked
  - gets the cash and the banks ends up charging them an overdraft fee

Advantages of DBMS - address all those problems

Much improved data consistency, accuracy, timeliness
Better access control, auditable, and consistent
Anomaly detection
Consistent, portable interface, even across networks
Safe: DB’s built to be safe against (1) From system failures (2) From malicious users
Convenient: Simple commands to – find a specific book, list all books in a certain category and price range, generate an order history, produce sales figures grouped by state, etc.

making unpredicted, or Ad Hoc, queries easy
Efficient: Don’t search all files in order to
- get price of one book,
- get all customers from northern California,
- get bestselling books from last week.
Massive data → DBMS’s carefully tuned for performance

Disadvantages of DBMS

Higher costs
Shifts complexity to management
Vendor dependence (usually)
- e.g., we’ll be using mySQL which isn’t 100% compatible with Oracle which isn’t 100% compatible with Microsoft SQL Server which isn’t 100% compatible with IBM’s DB2
Complex software like DBMS have updates and patches, while filesystems have far fewer

Kinds of databases

Single or multi-user
Enterprise or group
Centralized or distributed (even to the Cloud)
Operational vs. analytical (data warehouses)

Example database: online bookseller

Data = information on books (including categories, bestsellers, etc.), customers, pending orders, order histories, trends and preferences, etc.

Massive: hundreds of gigabytes at a minimum for medium-size online bookseller, and even more …
- keep all order histories over all time,
- keep clickstream logs,
- keep images of book covers and sample pages
Far too big for memory
Persistent: data outlives programs that operate on it
Multi-user: many people/programs accessing same database, or even the same data, simultaneously

Need careful controls

Appears similar to concurrent programming problems (synchronization, semaphores, etc.)

BUT: the data is not in main-memory variables, it’s in a storage device

Appears similar to file system concurrent access

BUT: want to control at smaller granularity

Also, databases may need to be distributed or replicated around the Internet

for network performance
for load balancing
for regulatory restrictions

Database history

Early 50’s and 60’s
- magnetic tapes
  - sequential only
  - updates required copy/rewrite
  - still used today for special situations (seismic DP, etc.)
60’s - 70’s
- Hard disks allowed direct access to data
- networking access began
1970 Edgar Codd made a breakthrough, defining the relational DB
- See paper on Resources page
- Won Turing award for it
- Data Independence (between logical and physical implementations)
- Concept of Normalization, used to avoid data anomolies
- like redundancy, data loss, and integrity
- Relational model and set-based query language like SQL
80’s brought
- full fledged SQL standard
- Object-Oriented DB’s
90’s
- Multi-terabyte Data warehouses enabling data mining
- Web commerce
2000’s
- Data Lakes
  - A data lake is a vast pool of raw data, the purpose for which is not yet defined. “Just back up the truck…”
  - A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
  - If poorly managed, a data lake becomes a “data swamp”.
- XML representations arrived
- NoSQL Alternatives to RDBMS, like MongoDB, Cassandra, Redis, CouchDB, and others.

Popular DB products

Stack Overflow Developer Survey Results 2024 provides an interesting view into the SW development world. Their description:

Each year since 2011, Stack Overflow has asked developers about their favorite technologies, coding habits, and work preferences, as well as how they learn, share, and level up. This year represents the largest group of respondents in our history: over 70,000 developers took our annual survey in January.

Let’s have a look at some of the report.

SQL is behind JS, HTML, and Python for most popular programming language among professional developers
MongoDB was the most popular NOSQL DB

Relational DB’s

In addition to MySQL (managed by Oracle) there is:

MySQL was the 2nd favorite DB, behind PostgresSQL.
- Postgres (“Post Ingres”, from Stonebraker who later won the Turing award)
Oracle (who owns MySQL) has their own DB
Microsoft: SQL Server, Access
IBM: DB2, Informix

NoSQL DB’s

Large-scale distributed DB system that stores structured and unstructured data in efficient ways.

Mongo DB (document-oriented DB, most popular NOSQL DB)
Redis (in-memory data structure store, used as a database, cache and message broker)
CouchDB (key-value)
Cassandra (hybrid key-value & column-oriented DB)
Neo4j (graph DB)

Towards the end of the course we will study these NOSQL databases, in general, and MongoDB in depth.